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Abstract 

Large intergenic non-coding RNAs (lincRNAs) are a new class of functional transcripts, and aberrant expression of lincRNAs 
was associated with several human diseases. The genetic variants in lincRNA transcription factor binding sites (TFBSs) can 
change lincRNA expression, thereby affecting the susceptibility to human diseases. To identify and annotate these 
functional candidates, we have developed a database SNP@lincTFBS, which is devoted to the exploration and annotation of 
single nucleotide polymorphisms (SNPs) in potential TFBSs of human lincRNAs. We identified 6,665 SNPs in 6,614 conserved 
TFBSs of 2,423 human lincRNAs. In addition, with ChlPSeq dataset, we identified 139,576 SNPs in 304,517 transcription factor 
peaks of 4,813 lincRNAs. We also performed comprehensive annotation for these SNPs using 1000 Genomes Project 
datasets across 1 1 populations. Moreover, one of the distinctive features of SNP@lincTFBS is the collection of disease- 
associated SNPs in the lincRNA TFBSs and SNPs in the TFBSs of disease-associated lincRNAs. The web interface enables both 
flexible data searches and downloads. Quick search can be query of lincRNA name, SNP identifier, or transcription factor 
name. SNP@lincTFBS provides significant advances in identification of disease-associated lincRNA variants and improved 
convenience to interpret the discrepant expression of lincRNAs. The SNP@lincTFBS database is available at http://bioinfo. 
hrbmu.edu.cn/SNP lincTFBS. 
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Introduction 

Large intergenic non-coding RNAs (lincRNAs) are recently 
emerging as a novel class of functional non-coding RNAs, which 
are more than 200 nucleotides in length, derive from the intervals 
between protein-coding genes, have similar exon-intro-exon 
structure, but lack of protein-coding capacity [1]. As yet, the 
quantity of discriminated human lincRNA transcripts continue to 
increase [2] , and many of them have been found to play important 
roles in multiple biological processes, including epigenetic 
regulation of protein-coding gene expression [3-5] and crucial 
action in development process [6]. Emerging evidence has also 
demonstrated that numerous lincRNAs were associated with a 
wide range of human diseases [7]. 

Recently, several profiling studies have revealed that dysregu- 
lated expression of lincRNAs was involved in several forms of 
human cancer [8]. For example, a study has reported that the 
expression level of lincRNA PCGEM1 was higher in prostate 
tumor specimens than in matched normal tissues [9]. LincRNA 
HOTAIR (HOX antisense intergenic RNA) can be regard as an 



independent cancer prognostic marker due to its significandy 
overexpression in breast cancer, hepatocellular cancer, colorectal 
cancer and laryngeal squamous cell carcinoma [10-12]. Another 
highly abundant lincRNA MALAT1 (also known as NEAT2) is 
originally identified as a marker for lung cancer metastasis; its 
expression is strongly regulated in many tumor entities including 
lung adenocarcinoma and hepatocellular carcinoma [13,14]. In 
addition, it has been demonstrated that up-regulation of a 
lincRNA HULC is highly associated with the incidence of 
hepatitis B virus (HBV) infection [15]. However, despite a number 
of lincRNAs having aberrant expression in disease states, the 
causality that affects the expression abundance of lincRNAs has 
yet to be completely understood. 

Previous studies have shown that single nucleotide polymor- 
phisms (SNPs) in transcription factor binding sites (TFBSs) of 
protein-coding genes could affect gene expression by altering 
transcription factor binding, and participated in human diseases 
[16-20]. A recent study on a tumor suppressor lincRNA has also 
demonstrated that a SNP (rs944289) could predispose to papillary 
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thyroid carcinoma through dysregulating lincRNA (PTCSC3) 
expression by decreasing the binding activity of both C/EBPa and 
C/EBP(3 [21]. Thus, SNPs in the human lincRNA TFBSs can act 
as a set of functional variants, which may disrupt transcription 
factor binding, resulting in the diversity of lincRNA expression 
and, potentially, diverse diseases. 

Furthermore, with the advent of high-throughput technologies, 
large-scale lincRNA annotation data, SNP data, predicted and 
experimentally supported TFBSs data have been generated. This 
provides a great opportunity to systematically identify SNPs in the 
human lincRNA TFBSs. For example, in the new update of 
NONCODE database, the lincRNA data set were expanded by 
collection of newly identified lincRNAs from published literatures 
and integration of the latest version of RefSeq and Ensembl [22]. 
LncRNADisease database collected experimentally supported 
IncRNA-disease associations and IncRNA interacting partners at 
various molecular levels [23]. ChlPBase database was developed 
to annotate and identify TFBSs and transcriptional regulatory 
relationships of IncRNAs and miRNAs from ChlP-Seq data [24] . 
In addition, the ENCODE project has compiled a large number of 
ChlP-Seq experiments for many human TFs in different cell lines 
and tissues [25] . Enriched peak regions of these ChlP-Seq data 
can be mapped to the promoter regions of lincRNAs, which 
facilitate the discovery of experimentally supported TFBSs of 
human lincRNAs in different cell lines and tissues, and also give us 
a better opportunity to identify SNPs in lincRNA TFBSs for a cell 
line of interest. 

Therefore, to provide a beneficial annotation of these potential 
functional variants in human TFBSs, we developed a 
SNP@lincTFBS database for integrating and annotating func- 
tional SNPs in predicted lincRNA TFBSs. We identified 6,665 
SNPs occurring in 6,614 TFBSs of 2,423 human lincRNAs, and 
provided a comprehensive and useful resource of candidate SNPs 
relevant to the aberrant expression of lincRNAs. The 
SNP@lincTFBS database will be helpful to identify functional 
SNPs of lincRNAs in the level of transcription and contribute to 
profound complex disease study. 

Materials and Methods 

Human lincRNA data 

We obtained 6,63 1 human lincRNAs with genomic coordinates 
from the lincRNA list of GENCODE project (version 16) [26], 
and removed lincRNAs without unique determinate chromosomal 
location. Finally, 5,835 lincRNAs were contained in 
SNP@lincTFBS. 

Identifying conserved TFBSs of human lincRNAs 

We downloaded the locations and scores of conserved TFBSs 
from the UCSC genome browser [27]. These data were obtained 
by running the program tfloc (Transcription Factor binding site 
LOCater) on multiz46way alignments, restricting only to the July 
2007 (mm9) mouse genome assembly, the November 2004 rat 
assembly (rn4), and the February 2009 human genome assembly 
(hgl9). A binding site is considered to be conserved across the 
alignment if its score meets the threshold score for its binding 
matrix in all 3 species (human, mouse and rat). Transcription 
factor information was downloaded from the Transfac Factor 
database, and the score and threshold were computed with the 
Transfac Matrix Database (v7.0) created by Biobase [28]. Then, 
We defined 5 kb upstream to 1 kb downstream region of the 
transcription start site of each lincRNA as its promoter region refer 
to previous study [29]. We identified the conserved TFBSs of 



human lincRNAs in these regions; as a result, we identified 33,181 
TFBSs in defined promoter regions of 3,839 human lincRNAs. 

Identifying TFBSs of lincRNA using genome-wide ChlP- 
Seq data 

We downloaded 690 ChlP-Seq datasets for 169 human 
transcription factors in different cell lines and tissues from 
ENCODE project [25]. These peak datasets were computed by 
a peak calling method (PeakSeq), which identified enriched peaks 
through comparing each ChlP-Seq dataset to corresponding 
control experiment [30]. Then, we identified the peaks that were 
located in the promoter regions of human lincRNAs (5 kb 
upstream to 1 kb downstream region of the transcription start 
site for each lincRNA). In total, we identified 323,256 transcrip- 
tion factor peaks of different transcription factors in 4,831 
lincRNA promoter regions. 

Identifying SNPs in the TFBSs of human lincRNA 

We downloaded SNPs (common and rare variants) in public 
dbSNP database (build ver. 137) and identified 6,665 SNPs within 
6,614 putative TFBSs of 2,423 human lincRNAs. In addition, with 
ChlPSeq dataset, we identified 139,576 SNPs in 304,517 
transcription factor peaks of 4,813 lincRNAs. Then, we down- 
loaded the annotation information of minor allele frequencies and 
others from 1000 Genomes Project (release of July 2012) datasets 
across 11 populations [31], and performed comprehensive 
annotation for these SNPs in lincRNA TFBSs. For each SNP in 
a lincRNA TFBS, we also extracted the flanking sequence of 30 nt 
up-/down-stream of the SNP position from RefSeq reference 
genomic sequence. 

Collecting experimentally supported disease-associated 
SNPs in lincRNA TFBSs 

We manually collected known disease-associated SNPs in 
lincRNAs TFBSs using PubMed to search the previous studies. 
We also annotated lincRNAs in SNP@lincTFBS that have been 
reported to be associated with diseases, and identified SNPs within 
their putative TFBSs. In addition, we integrated recendy well- 
known disease-associated SNPs and disease lincRNAs into 
SNP@lincTFBS database. 

Database implementation 

SNP@lincTFBS is an online query tool developed utilizing 
ECLIPSE platform as the frontend, and MySQL as the backend 
database. The web engine was implemented usingJSP technology, 
Struts framework and the Java connection pool Proxool, and web 
server was built using Apache Tomcat. 

Results 

Overview of the SNP@lincTFBS Database 

We developed a novel integrated database named 
SNP@lincTFBS that allows users to perform SNP and TFBS 
searches in human lincRNAs. In this database, we: 1) obtained 
human lincRNAs, 2) identified conversed TFBSs and transcription 
factor peaks in defined promoter regions of human lincRNAs, 3) 
identified SNPs in the TFBS of lincRNA and collected experi- 
mentally supported disease-associated SNPs in lincRNA TFBSs, 4) 
integrated annotation information of SNP, TFBS and lincRNA. 
The architecture of identifying SNPs in lincRNA TFBSs is shown 
in Figure 1. 

Currently, SNP@lincTFBS contains 8,290 entries of annotated 
SNP-TFBS-lincRNA associations, including 3,839 lincRNAs, 



PLOS ONE | www.plosone.org 



2 



July 2014 | Volume 9 | Issue 7 | e103851 



Database of SNPs in Human LincRNA TFBSs 



Ensembl Genes 71 



UCSC Regulation 



ENCODE 



dbSNP 137 



NCBI RefSeq 



T 



Human noncoding Genome 



Transcription Factor Binding Site 



Ensembl Genes 71 
5' 1=1 C 



1=1 1 I I 3' 

coding gene noncoding gene coding gene 



UCSC conserved TFBS(hg 1 9J/ENCODE TF peak 

TFBS/TF peak 



promoter(6kb) 



noncoding gene 



SNPs allele vs. TFBS sequence 



NCBI RefSeq Genes 
SNP allele 



TFBS sequence 



Single Nucleotide Polymorphism 



dbSNP: Build 137 
SNP ^TFBS/TF peak 



pro mote r(6kb) 



noncoding gene 



rs1 81 772599 G £ PCAT-1 

v r 



5' C 



TACATGTTTGCTTT 



rs1 81 772599 A Tr PCAT-1 



4- 

T 



TACATGTTTGCTTT 



Figure 1. Architecture of SNP@lincTFBS. 
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33,181 conserved TFBSs, 6,665 SNPs and 165 transcription 
factors. In addition, 19,878,236 entries of SNP-peak-lincRNA 
associations were stored in SNP@lincTFBS, including 4,831 
lincRNAs, 323,256 transcription factor peaks, 139,576 SNPs and 
169 transcription factors. We identified a large number of 
conserved TFBSs in the promoter regions of human lincRNAs 
and found that the distribution of SNPs in these lincRNA TFBSs 
was extensive (Figure 2A). Previous studies have shown that each 
transcription factor can bind to several TFBSs in the promoter 
regions of protein-coding genes, thereby controlling the transcrip- 
tion of genetic information from DNA to messenger RNA. We 
also found a similar phenomenon in human lincRNA and a 
transcription factor could bind to many conserved lincRNA 
TFBSs (-247 lincRNA), whereas -20 TFBSs that have been 
identified SNPs within them, and every 5.3 TFBSs had a SNP for 
each transcription factor (Figure 2B). In addition, we observed 



that high frequencies of SNPs within lincRNA TFBSs to be 
located around lincRNA start site (Figure 2C), suggesting that 
these SNPs within lincRNA TFBSs might greatly affect the 
expression of lincRNAs. 

Web interface 

The SNP@lincTFBS database website includes seven modules: 
home, search, overview, disease lincRNA, GWAS SNP, download 
and help (available at http:/ /bioinfo.hrbmu.edu.cn/ 
SNP_lincTFBS). HOME page provides a brief description of the 
SNP@lincTFBS database, users can browse the high-resolution 
flowchart of this work to get the main idea of this database. 
SEARCH page provides a quick search by query of three kinds of 
entries: 1) a lincRNA name (Ensembl ID), 2) an SNP identifier (rs 
number from dbSNP), and 3) a transcription factor name. Statistic 
of dataset contained in the database is introduced. Search result 
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Figure 2. SNPs in human lincRNA TFBSs. (A) The number distribution of lincRNAs classified as chromosomes. Blue bars represent all lincRNAs. 
Red bars represent lincRNAs have TFBSs in their promoter regions. Green bars represent lincRNAs have SNPs in their TFBSs. (B) Statistics of lincRNA 
TFBSs with SNPs for each transcription factor. The quantity of lincRNA TFBSs for each transcription factor (left). The quantity of lincRNA TFBSs with 
SNPs for each transcription factor (middle). Density of lincRNA TFBSs with SNPs for each transcription factor (right). (C) Distribution of SNPs in lincRNA 
TFBSs with respect to distance to the lincRNAs. The x-axis displays the 1 kb window within 5 kb upstream to 1 kb downstream region of the start site 
of lincRNA and the y-axis displays the fraction of SNPs in lincRNA TFBSs located within this window. 
doi:1 0.1 371 /journal, pone.01 03851 .g002 



shows lincRNA summary information and all identified TFBSs 
and TF peaks in promoter region of this lincRNA. SNPs in these 
TFBSs and TF peaks are listed below (Figure 3). OVERVIEW 
page provides a general overview of transcription factors stored in 
SNP@lincTFBS. Disease lincRNA page shows existing experi- 
mentally supported disease-associated lincRNAs with their anno- 
tations and internal links for their TFBSs and SNPs mapped 
within them. GWAS SNP page shows disease-associated SNPs 
from GWAS researches that can be mapped to the lincRNAs 
TFBSs, whole annotations about lincRNA and TFBS are also 
available by internal link. PubMed external link for relevant 
literature is provided. DOWNLOAD page allows users to 
download all data we provided at present, including TFBSs and 
TF peaks of lincRNA promoter regions and SNPs mapped within 
lincRNA TFBSs and TF peaks in the TXT format. HELP page 



provides detailed column label description of SNP@lincTFBS. 
Instruction and contact information are also obtained. 

Known disease SNPs in lincRNA TFBSs 

The SNP@lincTFBS database was developed not only as a 
resource for identifying SNPs in putative TFBSs of human 
lincRNAs, but also as a direction for further confirmation of 
predicting novel disease-associated SNPs and lincRNAs. Previous 
studies have found that lincRNAs may tend to associated with the 
same diseases with the disease-associated SNPs within their TFBSs 
by affecting the expression of lincRNAs [21]. We found 22 known 
disease-associated SNPs in lincRNAs TFBSs using PubMed to 
search the previous studies (Table 1). For example, we found two 
SNPs, rs2001844 and rs6982502 in two predicted TFBSs of a 
lincRNA ENSG0000025311 1. These two SNPs were identified 
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Figure 3. The homepage and an example of SNP@lincTFBS database. Screenshot of the main search page and corresponding result page, 
search as lincRNA ENSG000001 77640. 
doi:1 0.1 371 /journal.pone.01 03851 .g003 



to be associated with the variation in the magnitude of statin- 
mediated reduction in total and LDL-cholesterol based on a 
genome-wide association study [32], thus this lincRNA might have 
a relationship with cholesterol-associated diseases. Further exper- 
imental validation of the role of these disease-associated SNPs in 
lincRNA TFBSs might provide new insights into mechanisms 
underlying human diseases. 

We also found several lincRNAs in SNP@lincTFBS that have 
been reported to be associated with human diseases, and these 
lincRNAs had SNPs within their putative TFBSs. For example, we 
found human lincRNAs NAG 7, MEG3, PCAT1, CASC2 and 
LINC00032, which were involved in nasopharyngeal carcinoma 



[33], glioma and bladder cancer [34,35], prostate cancer [36], 
endometrial cancer [37] and melanoma [38]. We identified several 
SNPs in the TFBSs of these disease-associated lincRNAs. These 
SNPs might be potential risk SNPs for diverse diseases by 
regulating the expression of disease-associated lincRNAs. For 
example, the research on NAG7 gene involved in human 
nasopharyngeal carcinoma (NPC) susceptibility can be traced to 
more than a decade, and previous studies have found that NAG7 
played a key role by means of both expression and interaction, it 
could inhibit proliferation and induce apoptosis in NPC cell but 
also stimulate NPC cell invasion [22,33,39]. Soon after, NAG7 
gene was provided as a long intergenic non-protein coding RNA 
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Table 1. Disease-associated SNPs in lincRNA TFBSs. 



Disease or phenotype 
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PubMed ID 
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312 (LINC00312) in HGNC (HUGO Gene Nomenclature 
Committee) [40]. Recently, an investigation aiming to assess the 
possible correlations of LINC00312 expression with NPC 
progression based on microarray technology has indicated that 
LINC003 12 was significantly down-regulated in NPC tissues and 
it could represent a potential biomarker for metastasis, progression 
and prognosis in NPC [41]. In the SNP@lincTFBS database, we 
found a SNP (rsl 12175570) located within the TFBS for the 
transcription factor NF-kB and RelA in the promoter of 
LINC00312 gene (Ensembl ID: ENSG00000237697), and 
rsl 121 75570 might be a potential risk SNP for nasopharyngeal 
carcinoma by regulating the expression of LINC00312. 

Besides cancer, we also found several neurological or psychiatric 
disorder associated SNP in human lincRNA TFBSs. For example, 
we found three SNPs (rsl41600967, rsl 11946796, rsl47394431) 
in the TFBSs of a lincRNA, ENSG000002 14548 (also known as 
MEG3), ENSG000002 14548 has been demonstrated to be 
associated with multiple human diseases, including glioma and 
neuroblastoma [42,43]. We found three SNPs (rs2973034, 
rs2973034, rs78670708) in the TFBSs of a lincRNA, 
ENSG00000248587 (also known as GDNF-AS1), 
ENSG00000248587 has been demonstrated to be associated with 
Alzheimer disease [44]. In addition, we found a Alzheimer's 
disease risk SNPs (rs64721 16, p = 9.59 x 1Q" 5 ) in a lincRNA TFBS 
(ENSG00000253583) [45]. Therefore, further experimental ver- 
ification of this SNP might provide novel insights and lead to new 
treatments. Taking advantage of our database, it is possible to 
further investigate the mechanism of lincRNA involved in human 
diseases. 



Discussion 

Accumulating studies of dysregulated lincRNA expression in 
diverse cancers have suggested that lincRNAs might act as 
potential tumor suppressor genes and novel prospective therapeu- 
tic targets in cancer treatments. SNP@lincTFBS is designed to 
serve as a practical resource of SNPs in the TFBSs that dysregulate 
the expression of human lincRNAs. The database provides 
available genomic informations and annotations of SNPs in the 
TFBSs in putative promoter regions of human lincRNAs, and also 
a web-based interface allowed easy access to query and download 
flexibly. Most human lincRNAs have TFBSs in their promoter 
regions and the distribution of SNPs in these TFBSs of lincRNAs is 
widespread. 

Previous studies have demonstrated that the genetic variants in 
the TFBSs of human lincRNA regulatory regions may change 
lincRNA expression, and thereby affecting the susceptibility to 
human diseases [21]. Thus we developed the SNP@lincTFBS 
database, which is devoted to the exploration and annotation of 
SNPs in potential TFBSs of human lincRNAs. One of the 
distinctive features of SNP@lincTFBS is that all SNPs that can be 
mapped to human lincRNA TFBSs are identified and annotated. 
The other databases that are related to transcriptional regulation 
for IncRNAs, such ChlPBase [24], only collect TF-lncRNA 
regulatory relationships that have been identified from ChlP-Seq 
data. In SNP@lincTFBS, we considered not only transcription 
factor of lincRNAs (like ChlPBase), but also the SNPs that affect 
the capability of binding to the lincRNA promoter regions of each 
transcription factor. 

Our database has the potential to become an available resource 
for further studies of lincRNA function and complex disease. For 
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example, we found several disease-associated SNPs and lincRNAs 
in SNP@lincTFBS, suggested the potential application of the 
SNP@lincTFBS in the field of disease-associated lincRNA 
variants. We found multiple SNPs in the TFBSs of cancer- 
associated lincRNAs, further experimental verification of these 
disease candidates might yield novel insights into disease 
pathophysiology. In addition, we also found multiple SNPs in 
the TFBSs of neurological or psychiatric disorder associated 
lincRNAs, this finding was consistent with previous studies, which 
revealed that lincRNAs played important roles in brain [5] and 
neuropsychiatric disorders [46]. Although the current number is 
limited, with the growth of interest in human lincRNAs and the 
availability of high-throughput technologies, the total number of 
disease-associated lincRNAs and SNPs will undoubtedly continue 
to grow, SNP@lincTFBS will become increasingly useful in future 
studies. 

In the future, we envisage the database to be available as a 
semantically linked interoperable data resource. We hope that 
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